Accuracy improvement for identifying translation initiation sites in microbial genomes
نویسندگان
چکیده
MOTIVATION At present the computational gene identification methods in microbial genomes have a high prediction accuracy of verified translation termination site (3' end), but a much lower accuracy of the translation initiation site (TIS, 5' end). The latter is important to the analysis and the understanding of the putative protein of a gene and the regulatory machinery of the translation. Improving the accuracy of prediction of TIS is one of the remaining open problems. RESULTS In this paper, we develop a four-component statistical model to describe the TIS of prokaryotic genes. The model incorporates several features with biological meanings, including the correlation between translation termination site and TIS of genes, the sequence content around the start codon; the sequence content of the consensus signal related to ribosomal binding sites (RBSs), and the correlation between TIS and the upstream consensus signal. An entirely non-supervised training system is constructed, which takes as input a set of annotated coding open reading frames (ORFs) by any gene finder, and gives as output a set of organism-specific parameters (without any prior knowledge or empirical constants and formulas). The novel algorithm is tested on a set of reliable datasets of genes from Escherichia coli and Bacillus subtillis. MED-Start may correctly predict 95.4% of the start sites of 195 experimentally confirmed E.coli genes, 96.6% of 58 reliable B.subtillis genes. Moreover, the test results indicate that the algorithm gives higher accuracy for more reliable datasets, and is robust to the variation of gene length. MED-Start may be used as a postprocessor for a gene finder. After processing by our program, the improvement of gene start prediction of gene finder system is remarkable, e.g. the accuracy of TIS predicted by MED 1.0 increases from 61.7 to 91.5% for 854 E.coli verified genes, while that by GLIMMER 2.02 increases from 63.2 to 92.0% for the same dataset. These results show that our algorithm is one of the most accurate methods to identify TIS of prokaryotic genomes. AVAILABILITY The program MED-Start can be accessed through the website of CTB at Peking University: http://ctb.pku.edu.cn/main/SheGroup/MED_Start.htm.
منابع مشابه
GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-...
متن کاملA probabilistic method for identifying start codons in bacterial genomes
As the pace of genome sequencing has accelerated, the need for highly accurate gene prediction systems has grown. Computational systems for identifying genes in prokaryotic genomes have sensitivities of 98-99% or higher (Delcher et al., Nucleic Acids Res., 27, 4636-4641, 1999). These accuracy figures are calculated by comparing the locations of verified stop codons to the predictions. Determini...
متن کاملPrediction of translation initiation site for microbial genomes with TriTISA
UNLABELLED We report a new and simple method, TriTISA, for accurate prediction of translation initiation site (TIS) of microbial genomes. TriTISA classifies all candidate TISs into three categories based on evolutionary properties, and characterizes them in terms of Markov models. Then, it employs a Bayesian methodology for the selection of true TIS with a non-supervised, iterative procedure. A...
متن کاملA Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites
The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The me...
متن کاملPrediction of translation initiation sites on the genome of Synechocystis sp. strain PCC6803 by Hidden Markov model.
We developed a computer program, GeneHackerTL, which predicts the most probable translation initiation site for a given nucleotide sequence. The program requires that information be extracted from the nucleotide sequence data surrounding the translation initiation sites according to the framework of the Hidden Markov Model. Since the translation initiation sites of 72 highly abundant proteins h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 20 18 شماره
صفحات -
تاریخ انتشار 2004